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Abstract 

We present a novel multivariate classification technique based on Genetic Program- 
ming. The technique is distinct from Genetic Algorithms and offers several advan- 
tages compared to Neural Networks and Support Vector Machines. The technique 
optimizes a set of human-readable classifiers with respect to some user-defined per- 
formance measure. We calculate the Vapnik-Chervonenkis dimension of this class 
of learning machines and consider a practical example: the search for the Stan- 
dard Model Higgs Boson at the LHC. The resulting classifier is very fast to eval- 
uate, human-readable, and easily portable. The software may be downloaded at: 
http : / / cern . ch/~cranmer/PhysicsGP . html 

Key words: Genetic Programming, Triggering, Classification, VC Dimension, 
Genetic Algorithms, Neural Networks, Support Vector Machines 



1 Introduction 



The use of multivariate algorithms 
in the search for particles in High 
Energy Physics has become quite 
common. Traditionally, a search can 
be viewed from a classification point 
of view: from a tuple of physical mea- 
surements (i.e., momenta, energy, 
etc.) we wish to classify an event as 
signal or background. Typically, this 
classification is realized through a 
Boolean expression or cut designed 
by hand. The high dimensionality of 
the data makes this problem diffi- 
cult in general and favors more so- 



phisticated multivariate algorithms 
such as Neural Networks, Fisher Dis- 
criminants, Kernel Estimation Tech- 
niques, or Support Vector Machines. 
This paper focuses on a Genetic Pro- 
gramming approach and considers a 
specific example: the search for the 
Higgs Boson at the LHC. 

The use of Genetic Programming for 
classification is fairly limited; how- 
ever, it can be traced to the earl 
works on the subject by Koza [l| 
More recently, Kishore et al. ex- 
tended Koza's work to the muiticat- 
egory problem 0] . To the best of the 
authors' knowledge, the work pre- 
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sented in this paper is the first use of 
Genetic Programming within High 
Energy Physics. 

In Section 2 we provide a brief his- 
tory of evolutionary computation 
and distinguish between Genetic Al- 
gorithms (GAs) and Genetic Pro- 
gramming (GP). We describe our al- 
gorithm in detail for an abstract per- 
formance measure in Section 3 and 
discuss several specific performance 
measures in Section 4. 

Close attention is paid to the perfor- 
mance measure in order to leverage 
recent work applying the various re- 
sults of statistical learning theory in 
the context of new particle searches. 
This recent work consists of two com- 
ponents. In the first, the Neyman- 
Pearson theory is translated into the 
Risk formalism 0, 0]. The second 
component requires calculating the 
Vapnik-Chervonenkis dimension for 
the learning machine of interest. In 
Section 5, we calculate the Vapnik- 
Chervonenkis dimension for our Ge- 
netic Programming approach. 

Because evolution is an operation on 
a population, GP has some statisti- 
cal considerations not found in other 
multivariate algorithms. In Section 6 
we consider the main statistical is- 
sues and present some guiding princi- 
ples for population size based on the 
user-defined performance measure. 

Finally, in Section 7 we examine the 
application of our algorithm to the 
search for the Higgs Boson at the 
LHC. 



2 Evolutionary Computation 

In Genetic Programming (GP), a 
group of "individuals" evolve and 
compete against each other with re- 
spect to some performance measure. 
The individuals represent potential 
solutions to the problem at hand, 
and evolution is the mechanism by 
which the algorithm optimizes the 
population. The performance mea- 
sure is a mapping that assigns a 
fitness value to each individual. GP 
can be thought of clS cl Monte Carlo 
sampling of a very high dimensional 
search space, where the sampling is 
related to the fitness evaluated in the 
previous generation. The sampling 
is not ergodic - each generation is 
related to the previous generations 
- and intrinsically takes advantage 
of stochastic perturbations to avoid 
local extrema 1 . 

Genetic Programming is similar to, 
but distinct from Genetic Algo- 
rithms (GAs), though both methods 
are based on a similar evolutionary 
metaphor. GAs evolve a bit string 
which typically encodes parameters 
to a pre-existing program, function, 
or class of cuts, while GP directly 
evolves the programs or functions. 
For example, Field and Kanev 
used Genetic Algorithms to optimize 
the lower- and upper-bounds for six 
1-dimensional cuts on Modified Fox- 
Wolfram "shape" variables. In that 
case, the phase-space region was a 
pre-defined 6-cube and the GA was 



These are the properties that give 
power to Markov Chain Monte Carlo 
techniques. 
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simply evolving the parameters for 
the upper and lower bounds. On the 
other hand, our algorithm is not con- 
strained to a pre-defined shape or 
parametric form. Instead, our GP 
approach is concerned directly with 
the construction and optimization of 
a nontrivial phase space region with 
respect to some user-defined perfor- 
mance measure. 

In this framework, particular at- 
tention is given to the performance 
measure. The primary interest in the 
search for a new particle is hypoth- 
esis testing, and the most relevant 
measures of performance are the ex- 
pected statistical significance (usu- 
ally reported in Gaussian sigmas) or 
limit setting potential. The different 
performance measures will be dis- 
cussed in Section 4, but consider a 
concrete example: s/ y/b, where s and 
b are the number of signal and back- 
ground events satisfying the event 
selection, respectively. 



3 The Genetic Programming 
Approach 

While the literature is replete with 
uses of Genetic Programming and 
Genetic Algorithms, direct evolu- 
tion of cuts appears to be novel. 
In the case at hand, the individu- 
als are composed of simple arith- 
metic expressions, /, on the input 
variables v. Without loss of gen- 
erality, the cuts are always of the 
form —1 < f(v) < 1. By scal- 
ing, f(v) — > af(v), and translation, 
f(v) — > f(v)+b, of these expressions, 
single- and double-sided cuts can be 




Evaluated Expression 



Fig. 1. Signal and Background his- 
tograms for an expression. 

produced. An individual may consist 
of one or more such cuts combined by 
the Boolean conjunction AND. Fig. 1 
shows the signal and background 
distributions of four expressions that 
make up the most fit individual in a 
development trial. 

Due to computational considera- 
tions, several structural changes 
have been made to the naive imple- 
mentation. First, an Island Model 
of parallelization has been imple- 
mented (see Section 3.5). Secondly, 
individuals' fitness can be evaluated 
on a randomly chosen sub-sample of 
the training data, thus reducing the 
computational requirements at the 
cost of statistical variability. There 
are several statistical considerations 
which are discussed in Section 6. 

3. 1 Individual Structure, Mutation, 
and Crossover 

The genotype of an individual is a 
collection of expression trees similar 
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Fig. 2. An example of crossover. At 
some given generation, two parents (a) 
and (b) are chosen for a crossover mu- 
tation. Two subtrees, shown in bold, 
are selected at random from the par- 
ents and are swapped to produce two 
children (c) and (d) in the subsequent 
generation. 

to abstract syntax trees that might 
be generated by a compiler as an in- 
termediate representation of a com- 
puter program. An example of such a 
tree is shown in Fig. 2a which corre- 
sponds to a cut \4.2vi + t>2/1.5| < 1. 
Leafs are either constants or one of 
the input variables. Nodes are simple 
arithmetic operators: addition, sub- 
traction, multiplication, and safe di- 
vision 2 . When an individual is pre- 
sented with an event, each expression 
tree is evaluated to produce a num- 
ber. If all these numbers lie within the 
range (—1,1), the event is considered 
signal. Otherwise the event is classi- 
fied as background. 

Initial trees are built using the PTC1 
algorithm described in [fj. After each 
generation, the trees are modified by 



2 Safe division is used to avoid division 
by zero. 



mutation and crossover. Mutation 
comes in two flavors. In the first, a 
randomly chosen expression in an in- 
dividual is scaled or translated by a 
random amount. In the second kind 
of mutation, a randomly chosen sub- 
tree of a randomly chosen expression 
is replaced with a randomly gener- 
ated expression tree using the same 
algorithm that is used to build the 
initial trees. 

While mutation plays an important 
role in maintaining genetic diversity 
in the population, most new individ- 
uals in a particular generation result 
from crossover. The crossover opera- 
tion takes two individuals, selects a 
random subtree from a random ex- 
pression from each, and exchanges 
the two. This process is illustrated in 
Fig. 2. 



3.2 Recentering 



Some expression trees, having been 
generated randomly, may prove to 
be useless since the range of their 
expressions over the domain of their 
inputs lies well outside the interval 
(— 1, 1) for every input event. When 
an individual classifies all events 
in the same way (signal or back- 
ground), each of its expressions is 
translated to the origin for some ran- 
domly chosen event exemplar vq, viz. 
f{v) -> f{v) - f(v ). This modifi- 
cation is similar to, and thus reduces 
the need for, normalizing input vari- 
ables. 
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3. 3 Fitness Evaluation s 

I 
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Fitness evaluation consumes the ma- 
jority of time in the execution of the 
algorithm. So, for speed, the fitness 
evaluation is done in C. Each individ- 
ual is capable of expressing itself as a 
fragment of C code. These fragments 
are pieced together by the Python 
program, written to a file, and com- 
piled. After linking with the train- 
ing vectors, the program is run and 
the results communicated back to the 
Python program using standard out- 
put. 

The component that serializes the 
population to C and reads the results 
back from the generated C program 
is configurable, so that a user-defined 
performance measure may be imple- 
mented. 



3.4 Evolution & Selection Pressure 

After a given generation of individ- 
uals has been constructed and the 
individuals' fitnesses evaluated, a 
new generation must be constructed. 
Some individuals survive into the 
new generation, and some new indi- 
viduals are created by mutation or 
crossover. In both cases, the popu- 
lation must be sampled randomly. 
To mimic evolution, some selection 
pressure must be placed on the indi- 
viduals for them to improve. This se- 
lection pressure is implemented with 
a simple Monte Carlo algorithm and 
controlled by a parameter a > 1. 
The procedure is illustrated in Fig. 3. 
In a standard Monte Carlo algo- 




Fig. 3. Monte Carlo sampling of indi- 
viduals based on their fitness. A uni- 
form variate x is transformed by a sim- 
ple power to produce selection pressure: 
a bias toward individuals with higher 
fitness. 



rithm, a uniform variate x G [0, 1] 
is generated and transformed into 
the variable of interest by the in- 
verse of its cumulative distribution. 
Using the cumulative distribution of 
the fitness will exactly reproduce the 
population without selection pres- 
sure; however, this sampling can be 
biased with a simple transformation. 
The right plot of Fig. 3 shows a uni- 
form variate x being transformed 
into x 1 ^, which is then inverted (left 
plot) to select an individual with a 
given fitness. As the parameter a 
grows, the individuals with high fit- 
ness are selected increasingly often. 

While the selection pressure mech- 
anism helps the system evolve, it 
comes at the expense of genetic di- 
versity. If the selection pressure is 
too high, the population will quickly 
converge on the most fit individual. 
The lack of genetic diversity slows 
evolutionary progress. This behavior 
can be identified easily by looking at 
plots such as Fig. 4. We have found 
that a moderate selection pressure 
a G [1,3] has the best results. 
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3.5 Parallelization and the Island 
Model 



GP is highly concurrent, since differ- 
ent individuals' fitness evaluations 
are unrelated to each other, and di- 
viding the total population into a 
number of sub-populations is a sim- 
ple way to parallelize a GP problem. 
Even though this is a trivial modi- 
fication to the program, it has been 
shown that such coarse grained par- 
allelization can vield greater-than- 
linear speedup Our system uses 
a number of Islands connected to a 
Monitor in a star topology. CORBA 
is used to allow the Islands, which are 
distributed over multiple processors, 
to communicate with the Monitor. 

Islands use the Monitor to exchange 
particularly fit individuals each gen- 
eration. Since a separate monitor 
process exists, a synchronous ex- 
change of individuals is not neces- 
sary. The islands are virtually con- 
nected to each other (via the Moni- 
tor) in a ring topology. 



4 Performance Measures 



The Genetic Programming approach 
outlined in the previous section is a 
very general algorithm for producing 
individuals with high fitness, and it 
allows one to factorize the definition 
of fitness from the algorithm. In this 
section we examine the function(al) 
which assigns each individual its fit- 
ness: the performance measure. 

Before proceeding, it is worthwhile 
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Fig. 4. The fitness of the population as a 
function of time. This plot is analogous 
to a neural network error vs. epoch plot, 
with the notable exception that it de- 
scribes a population and not an individ- 
ual. In particular, the neural network 
graph is a 1-dimensional curve, but this 
is a two dimensional distribution. 

to compare GP to popular multivari- 
ate algorithms such as Support Vec- 
tor Machines and Neural Networks. 
Support Vector Machines typically 
try to minimize the risk of misclas- 
sification Y,i \y% ~ f(vi)\, where y { is 
the target output (usually or -1 
for background and 1 for signal) and 
f(vi) is the classification of the i th in- 
put. This is slightly different than the 
error function that most Neural Net- 
works with backpropagation attempt 
to minimize: £i \y { - f(vi)\ 2 jUsf. In 
both cases, this performance measure 
is usually hard-coded into a highly 
optimized algorithm and cannot be 
easily replaced. Furthermore, these 
two choices are not always the most 
appropriate for High Energy Physics, 
as will be discussed in Section 4.1. 

The most common performance 
measure for a particle search is the 
Gaussian significance, s/y/b, which 
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measures the statistical significance 
(in "sigmas") of the presence of a 
new signal. The performance mea- 
sure s I Vb is calculated by determin- 
ing how many signal events, s, and 
background events, b, a given indi- 
vidual will select in a given amount 
of data (usually measured in fb _1 ). 

The s/yb is actually an approxi- 
mation of the Poisson significance, 
ap, the probability that an expected 
background rate b will fluctuate to 
s + b. The key difference between the 
two is that as s, b — > 0, the Poisson 
significance will always approach 0, 
but the Gaussian significance may 
diverge. Hence, the Gaussian signifi- 
cance may lead to highly fit individ- 
uals that accept almost no signal or 
background events. 

The next level of sophistication in 
significance calculation is to include 
systematic error in the background 
only prediction b. These calcula- 
tions tend to be more difficult and 
the field has not adopted a stan- 
dard [Hi H , 13] . It is also quite com- 
mon to improve the statistical signif- 
icance of an analysis by including a 
discriminating variable 3 • 



In contrast, one may be more inter- 
ested in excluding some proposed 
particle. In that case, one may wish 
to optimize the exclusion potential. 
The exclusion potential and discov- 
ery potential of a search are related, 
and G. Punzi has suggested a perfor- 
mance measure which takes this into 
account quite naturally (l4| . 

Ideally, one would use as a perfor- 
mance measure the same procedure 



that will be used to quote the results 
of the experiment. For instance, there 
is no reason (other than speed) that 
one could not include discriminating 
variables and systematic error in the 
optimization procedure (in fact, the 
authors have done both). 



4-1 Direct vs. Indirect Methods 



Certain approaches to multivariate 
analysis leverage the many powerful 
theorems of statistics, assuming one 
can explicitly refer to the joint prob- 
ability density of the input variables 
and target values p(v,y). This de- 
pendence places a great deal of stress 
on the asymptotic ability to estimate 
p(v, y) from a finite set of samples 
{{v,y)i}. There are many such tech- 
niques for estimating a multivariate 
density function p(v, y) given the 
samples jla, llfij ]. Unfortunately, for 
high dimensional domains, the num- 
ber of samples needed to enjoy the 
asymptotic properties grows very 
rapidly; this is known as the curse of 
dimensionality. 

Formally, the statistical goal of a 
new particle search is to minimize 
the rate of Type II error. This is 
logically distinct from, but asymp- 
totically equivalent to, approximat- 
ing the likelihood ratio. In the case 
of no interference between the sig- 
nal and background, this is logically 
distinct from, but asymptotically 
equivalent to, approximating the 
signal-to-background ratio. In fact, 
most multivariate algorithms are 
concerned with approximating an 
auxiliary function that is one-to-one 



7 



with the likelihood ratio. Because the 
methods are not directly concerned 
with minimizing the rate of Type 
II error, they should be considered 
indirect methods. Furthermore, the 
asymptotic equivalence breaks down 
in most applications, and the indi- 
rect methods are no longer optimal. 
Neural Networks, Kernel Estimation 
techniques, and Support Vector Ma- 
chines all represent indirect solutions 
to the search for new particles. The 
Genetic Programming approach is a 
direct method concerned with opti- 
mizing a user-defined performance 
measure. 



5 Statistical Learning Theory 

In 1979, Vapnik provided a re- 
markable family of bounds relating 
the performance of a learning ma- 
chine and its generalization capac- 
ity fvi\ . The capacity, or Vapnik- 
Chervonenkis dimension (VCD) is 
a property of a set of functions, or 
learning machines, {f(v;a)}, where 
a is a set of parameters for the learn- 
ing machine . 

In the two-class pattern recogni- 
tion case considered in this paper, 
an event x is classified by a learn- 
ing machine such that f(v;a) G 
{signal, background}. Given a set of 
I events each represented by there 
are 2 l possible permutations of them 
belonging to the class signal or back- 
ground. If for each permutation there 
exists a member of the set {f(v; a)} 
which correctly classifies each event, 
then we say the set of points is shat- 
tered by that set of functions. The 




Fig. 5. The VCD for a line in R 2 is 3. 



VCD for a set of functions {f(v;a)} 
is defined as the maximum number 
of points which can be shattered by 
{f(v;a)}. If the VCD is h, it does 
not mean that every set of h points 
can be shattered, but that there ex- 
ists some set of h points which can 
be shattered. For example, a hyper- 
plane in M™ can shatter n+1 points 
(see Fig. 5 for n = 2). 

In the modern theory of machine 
learning, the performance of a learn- 
ing machine is usually cast in the 
more pessimistic setting of risk. In 
general, the risk, R, of a learning 
machine is written as 

R(a) = J Q(v, y; a) p(v, y) dvdy 

(1) 

where Q measures some notion of 
loss between f(v\ a) and the tar- 
get value y. For example, when 
classifying events, the risk of mis- 
classification is given by Eq. 1 with 
Q(v,y;a) = \y - f(v;a)\. Simi- 
larly, for regression tasks one takes 
Q(v, y; a) = (y - f(v; a)) 2 . Most of 
the classic applications of learning 
machines can be cast into this for- 
malism; however, searches for new 
particles place some strain on the 
notion of risk 0, Q| . 

The starting point for statistical 
learning theory is to accept that we 
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might not know p(v, y) in any an- 
alytic or numerical form. This is, 
indeed, the case for particle physics, 
because only {{v,y)i} can be ob- 
tained from the Monte Carlo con- 
volution of a well-known theoretical 
prediction and complex numerical 
description of the detector. In this 
case, the learning problem is based 
entirely on the training samples 
{{ViU)i} with I elements. The risk 
functional is thus replaced by the 
empirical risk functional 



^emp (c) 



-j^Q{vhVi]oc). (2) 

1 i=l 



There is a surprising result that the 
true risk R(a) can be bounded inde- 
pendent of the distribution p(v, y). In 
particular, for < Q(v, y; a) < 1 



R(a) <R emp 
+ 



a) 



(3) 



\ 



Mlog(2*/fe) + l)-log(77/4) \ 



where h is the VC dimension and r\ is 
the probability that the bound is vi- 
olated. As rj — > 0, h — > oo, or / — > 
the bound becomes trivial. The sec- 
ond term of the right hand side is of- 
ten referred to as the VC confidence 
- for h = 200, I = 10 5 , and 77 = 95% 
the VC confidence is about 12%. 

While the existence of the bounds 
found in Eq. 3 are impressive, they 
are frequently irrelevant. In partic- 
ular, for Support Vector Machines 
with radial basis functions for ker- 
nels the VCD is formally infinite 
and there is no bound on the true 
risk. Similarly, for Support Vector 



Machines with polynomial kernels 
of degree p and data embedded in d 
dimensions, the VCD is ( p+ p _1 ) + 1 
which grows very quickly. 

This motivates a calculation of the 
VCD of the GP approach. 



5.1 VCD for Genetic Programming 



The VC dimension, h, is a prop- 
erty of a fully specified learning 
machine. It is meaningless to cal- 
culate the VCD for GP in general; 
however, it is sensible if we pick a 
particular genotype. For the slightly 
simplified genotype which only uses 
the binary operations of addition, 
subtraction, and multiplication, all 
expressions are polynomials on the 
input variables. It has been shown 
that for learning machines which 
form a vector space over their pa- 
rameters, 3 the VCD is given by the 
dimensionality of the span of their 
parameters [19j . Because the Genetic 
Programming approach mentioned is 
actually a conjunction of many such 
cuts, one also must use the theorem 
that the VCD for Boolean conjunc- 
tions, b, among learning machines 
is given by VCD(6(/i, • • • , /*)) < 
Cfc maXjVCD(/,), where is a con- 
stant 0. 

If we placed no bound on the size of 
the program, arbitrarily large poly- 
nomials could be formed and the 



3 A learning machine, J-, is a vector 
space if for any two functions /, g € 
T and real numbers a, b the function 
af + bg £ T . Polynomials satisfy these 
conditions. 
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f(x,y;a) = a x 

+ a 4 • x ■ x 
+ a 7 ■ x ■ x ■ y 



+a 2 ■ x 
+a 5 • x ■ y 
+a 8 ■ x-y-y 



+ a 3 -y 

+ a 6 -y-y 

+ ag ■ x ■ x ■ y ■ y 



Fig. 6. An explicit example of the largest polynomial on two variables with degree 
two. In total, 53 nodes are necessary for this expression which has only 9 independent 
parameters. 



VCD would be infinite. However, by 
placing a bound on either the size 
of the program or the degree of the 
polynomial, we can calculate a sen- 
sible VCD. The remaining step nec- 
essary to calculate the VCD of the 
polynomial Genetic Programming 
approach is a combinatorial prob- 
lem: for programs of length L, what 
is the maximum number of linearly 
independent polynomial coefficients? 
Fig. 6 illustrates that the smallest 
program with nine linearly inde- 
pendent coefficients requires eight 
additions, eighteen multiplications, 
eighteen variable leafs, and nine con- 
stant leafs for a total of 53 nodes. A 
small Python script was written to 
generalize this calculation. 

The Genetic Programming approach 
with polynomial expressions has a 
relatively small VCDs (in our tests 
with seven variables nothing more 
than h = 100 was found) which 
affords the relevance of the upper- 
bound proposed by Vapnik. 

5. 2 VCD of Neural Networks 

In order to apply Eq. 3, one must de- 
termine the VC dimension of Neural 
Networks. This is a difficult prob- 
lem in combinatorics and geometry 



aided by algebraic techniques. Ed- 
uardo Sontag has an excellent re- 
view of these techniques and shows 
that the VCD of neural networks 
can, thus far, only be bounded fairly 
weakly In particular, if we de- 
fine p as the number of weights and 
biases in the network, then the best 
bounds are p 2 < h < p 4 . In a typical 
particle physics neural network one 
can expect 100 < p < 1000, which 
translates into a VCD as high as 10 12 , 
which implies I > 10 13 for reasonable 
bounds on the risk. These bounds 
imply enormous numbers of training 
samples when compared to a typical 
training sample of 10 5 . Sontag goes 
on to show that these shattered sets 
are incredibly special and that the 
set of all shattered sets of cardinality 
greater than p = 2p + 1 is mea- 
sure zero in general. Thus, perhaps a 
more relevant notion of the VCD of 
a neural network is given by p. 



6 Statistical Fluctuations in 
the Fitness Evaluation 

In this section we examine the trade- 
off between the time necessary to 
evaluate the fitness of an individ- 
ual and the accuracy of the fitness 
when evaluated on a finite sample of 
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events. Holding computing resources 
fixed, the two limiting cases are: 

1 With very large sample sizes, one 
can expect excellent estimation 
of the fitness of individuals and 
a clear "winner" at the expense 
of very little mutation and poorly 
optimized individuals. 

2 With very small sample sizes, one 
can expect many mutations lead- 
ing to individuals with very high 
fitness which do not perform as re- 
ported on larger samples. 

Illustrated in Fig. 7 is the distribu- 
tion of fitness for a given "winning" 
individual generated with a large en- 
semble of training samples each with 
400 events. The fitness reported in 
the last generation of the training 
phase (indicated with an arrow) is 
much higher than the mean of this 
distribution. In fact, the probability 
to have randomly chosen a sample 
of 400 events which would produce 
such a high empirical significance is 
about 0.1%. 

While the chance that an arbitrary 
individual's fitness evaluates several 
standard deviations from the mean is 
quite small, with thousands, maybe 
millions, of individual programs the 
chance that one will fluctuate signif- 
icantly can be quite large. Further- 
more, the winning individual has a 
much higher chance of a significant 
upward fluctuation, because individ- 
uals with upward fluctuations have a 
higher chance of being the winner. 

Having recognized that statistical 
fluctuations in the fitness evaluation 
complicate matters, we have devel- 
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Fig. 7. The distribution of fitness eval- 
uated on a single individual with an 
ensemble of testing samples (each with 
400 events). The dashed vertical arrow 
indicates the fitness evaluated with a 
specific testing sample in the last gener- 
ation of training. The large (~ 3<r) up- 
ward fluctuation in the evaluated fitness 
largely enhanced the chance this indi- 
vidual would be chosen as the winner. 

oped a few guiding principles for 
reliable use of the algorithm. 

• For training, the standard devia- 
tion of the fitness distribution eval- 
uated on N events should be on the 
order of a noticeable and marginal 
improvement in the fitness based 
on the users performance measure. 

• Select the winning individual with 
a large testing sample. 

If we take as our measure of perfor- 
mance s/ Vb, then it is possible to cal- 
culate the variance due to the fluctu- 
ations in s and b. The expected error 
is given by the standard propagation 
of errors formula. In particular, 



/ s V As 2 A6V 
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where s = La s e s and b = La b e b 
via standard rate calculations and 
As 2 = e s (l — e s )L 2 a 2 /N s and sim- 
ilarly Ab 2 = e b (l - e b )L 2 a 2 /N b via 
binomial statistics. For the analysis 
presented in Section 7, the selec- 
tion efficiency for signal and back- 
ground are e s ~ 50% and e b ~ 5%, 
respectively. The predicted rate of 
signal and background events are 
La s ~ 100, and La b ~ 560, respec- 
tively. Using these values, one can 
expect a 10% (5%) relative error on 
the fitness evaluation with a sample 
of N s = N b = 100 (400) events. Anal- 
ogous calculations can be made for 
any performance measure (though 
they may require numerical studies) 
to determine a reasonable sample 
size. The rule of thumb that relative 
errors scale as 1/y/N is probably 
reasonable in most cases. 



7 Case Study: The Search for 
the Higgs Boson at the LHC 

Finally we consider a practical ex- 
ample: the search for the Higgs bo- 
son at the LHC. While there are 
many channels available, the recent 
Vector Boson Fusion analyses offer 
a sufficiently complicated final state 
to warrant the use of multivariate 
algorithms. 

The Higgs at the LHC is produced 
predominantly via gluon-gluon fu- 
sion. For Higgs masses such that 
M H > 100 GeV the second domi- 
nant process is Vector Boson Fu- 
sion. The lowest order Feynman dia- 
gram of the production of Higgs via 
VBF is depicted in Fig 8. The decay 




Fig. 8. Tree-level diagram of Vector 
Boson Fusion Higgs production with 
H -> W+W~ -> l+l'uu 

channel chosen is H — > W + W~ — > 
e ± ^ =F z/I7, e + e~z/I7, fi + fi~vV. These 
channels will also be referred to as 
e/i, ee, and /i/z, respectively. 

These analyses were performed at 
the parton level and indicated that 
this process could be the most pow- 
erful discovery mode at the LHC in 
the range of the Higgs mass, M H , 
115 < M H < 200 GeV 0. These 
analyses were studied specifically 
in the ATLAS environment using a 
fast simulation of the detector j2l|. 
Two traditional cut analyses, one 
for a broad mass range and one op- 
timized for a low-mass Higgs, were 
develo ped and documented in refer- 
ences jm and [i^. We present re- 
sults from previous studies without 
systematic errors on the dominant ti 
background included. 

In order to demonstrate the potential 
for multivariate algorithms, a Neural 
Network analysis was performed |24[ . 
The approach in the Neural Network 
study was to present a multivariate 
analysis comparable to the cut anal- 
ysis presented in |22j . Thus, the anal- 
ysis was restricted to kinematic vari- 
ables which were used or can be de- 
rived from the variables used in the 
cut analysis. 
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The variables used were: 

• Arju - the pseudorapidity differ- 
ence between the two leptons, 

• A(j)u - the azimuthal angle differ- 
ence between the two leptons, 

• Mu - the invariant mass of the two 
leptons, 

• Arjjj - the pseudorapidity differ- 
ence between the two tagging jets, 

• A(j)jj - the azimuthal angle differ- 
ence between the two tagging jets, 

• Mjj - the invariant mass of the two 
tagging jets, and 

• Mt - the transverse mass. 

In total, three multivariate analyses 
were performed: 

• a Neural Network analysis using 
backpropagation with momentum, 

• a Support Vector Regression anal- 
ysis using Radial Basis Functions, 
and 

• a Genetic Programming analysis 
using the software described in 
this Communication. 

The Neural Network (NN) analysis 
is well documented in reference |24(. 
The analysis were performed with 
both the Stutgart Neural Network 
Simulator (SNNS) 4 and MLPfit 5 
with a 7-10-10-1 architecture. 

For the Support Vector Regression 
(SVR) analysis, the BSVM-2.0 6 li- 
brary was used. 



4 SNNS can be found here: 

www-ra . inf ormatik . uni-tuebingen . de 

5 MLPfit can be found here: 
cern. ch/~schwind/MLPf it .html 

6 BSVM can be found here: 

www . csie . ntu . edu/~c j lin/bsvm 



The only parameters are the cost 
parameter, set to C — 1000, and 
the kernel function. BSVM does not 
support weighted events, so an "un- 
weighted" signal and background 
sample was used for training. 



Because the trained machine only de- 
pends on a small subset of "Support 
Vectors", performance is fairly sta- 
ble after only a thousand or so train- 
ing samples. In this case, 2000 signal 
and 2000 background training events 
were used. 



Both NN and SVR methods produce 
a function which characterizes the 
signal-likeness of an event. A sep- 
arate procedure is used to find the 
optimal cut on this function which 
optimizes the performance measure 
(in this case the Poisson signficance, 
o>). Fig. 9 shows the distribution of 
the SVR (left) and NN (right) out- 
put values. The optimal cut for the 
SVR technique is shown vertical 
arrow. 



Tab. 1 compares the Poisson signifi- 
cance, ap, for a set of reference cuts, 
a set of cuts specifically optimized 
for low-mass Higgs, Neural Networks, 
Genetic Programming, and Support 
Vector Regression. It is very pleas- 
ing to see that the multivariate tech- 
niques achieve similar results. Each of 
the methods has its own set of advan- 
tages and disadvantages, but taken 
together the methods are quite com- 
plementary. 
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Signal 
Background 




SVR output 



Neural Network output 



Fig. 9. Support Vector Regression and Neural Network output distributions for signal 
and background for 130 GeV Higgs boson in the e[i channel. 



Ref. Cuts low-mjy Opt. Cuts NN GP SVR 



120 ee 
120 e/j 
120 fifi 
Combined 



0.87 
2.30 
1.16 
2.97 



1.25 
2.97 
1.71 
3.91 



1.72 1.66 1.44 

3.92 3.60 3.33 

2.28 2.26 2.08 

4.98 4.57 4.26 



130 en 4.94 6.14 7.55 7.22 6.59 

Table 1 

Expected significance for two cut analyses and three multivariate analyses for dif- 
ferent Higgs masses and final state topologies. Significance is expressed in terms of 
Gaussian Sigmas, but calculated with Poisson statistics. 



8 Conclusions 



We have presented an implemen- 
tation of a Genetic Programming 
system specifically applied to the 
search for new particles. In our ap- 
proach a group of individuals com- 
petes with respect to a user-defined 
performance measure. The genotype 
we have chosen consists of Boolean 
conjunctions of simple arithmetic 
expressions of the input variables re- 
quired to lie in the interval (—1,1). 
Our implementation includes an is- 
land model of parallelization and a 



recentering algorithm to dramati- 
cally improve performance. We have 
emphasized the importance of the 
performance measure and decoupled 
fitness evaluation from the optimiza- 
tion component of the algorithm. 
We have touched on the relationship 
of Statistical Learning Theory and 
VC dimension to the search for new 
particles and multivariate analysis 
in general. Finally, we have demon- 
strated that this method has simi- 
lar performance to Neural Networks 
(the de facto multivariate analysis 
of High Energy Physics) and Sup- 
port Vector Regression. We believe 
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that this technique's most relevant 
advantages are 

• the ability to provide a user- 
defined performance measure 
specifically suited to the problem 
at hand, 

• the speed with which the resulting 
individual / cut can be evaluated, 

• the fundamentally important abil- 
ity to inspect the resulting cut, and 

• the relatively low VC dimension 
which implies the method needs 
only a relatively small training 
sample. 
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